I. Introduction

In the new era of Internet and technology, there are more and more startups now operating from over the world.Some of them are really successful but the majority of the startups failed and collapsed after time. However, some startups or small companies are taken or acquired by other big corporations to keep developing. For example, Instagrams and WhatsApp are the most popular companies acquired by big companies and keep growing out successfully. One of the challenges for the startups, investments or acquirers are how they can quantify or predict the price of that company. This study aims to understand the relationship between the price of the company in the acquisition with the total fundings which they received from the investors, the number of funding rounds, how long they are operating from the founded year and the difference between other regions. A model will be constructed to depict this relationship.

The data used in this study is obtained from CrunchBase company. CrunchBases is owned and operated by TechCrunch. This is a platform for finding business information about private and public companies. In Crunchbase, we can find the investments and funding information, founding members and individuals in leadership positions, mergers and acquisition, news and industry trends. The dataset is downloaded from https://public.tableau.com/en-us/s/resources.

The continuous variable total fundings, number of funding rounds and number of operations years are used to quantify the Price in USD of any company. The total funding represents the money for a startup that can prove the worthiness of its model and products. At the beginning, it is impossible to prove any hypothetical business to success. The total funding in this dataset is the total amount money that a company receives during their operation time. The total funding is the sum of the money from the number of funding rounds. There are other types of funding rounds available to any startup. And it depends on the industry and the level of interest among potential investors. In the business world, the “seed” funding round is popular at the beginning. Next, these funding rounds can be followed by Series A,B and C funding rounds. According to Investopedia, an acquisition is when one company purchases most or all of another company’s share to gain the control of that company. Acquisitions are very common in the business world, it may occur with the target company’s approval, or in spite of its disapproval. Because CrunchBase is the business database company that keeps track of all the information companies worldwide, the acquired prices are based on several currencies. So every price is converted to USD in the exchange rate in 2014 which is the same time as the updated database. There is limited information or data that can fully describe the value of a company on the market that can help us evaluate the value of the company. The goal of this study is to build a model that can help to predict the price of a company in USD. The total funding that a company receives from investors may be related to the price value of the company in the market. Moreover, the number of funding rounds and operation years may contribute in explaining the variability in the acquired prices of a company.

After organizing the data, there are 868 companies represented in the data set.

Table 1.Sample for 5 randomly chosen companies of the data set used in this study
company_name price_amount_USD funding_total_usd acquirer_country_code funding_rounds num_years
Admeld 362364800 3.0e+07 USA 3 7
Optimal, Inc.  35000000 7.6e+06 USA 3 6
kaufDA 40000000 0.0e+00 DEU 2 6
Caviar 90000000 1.5e+07 USA 2 2
SpaceClaim 85000000 4.9e+07 USA 5 9


II. Exploratory data analysis


Table 2: Summary for the Price amount in USD
n min median mean max sd
868 64.26034 101500000 335657357 1.9e+10 968969461
Figure 1. Distribution for the Price in USD for individual companies

Figure 1. Distribution for the Price in USD for individual companies

Figure 1 Our total sample size is 868 (Table 2). The mean of the Price in USD is about 335,657,357 USD, which is far greater than the median 101,500,000, indicating that the Price in USD distribution is heavily right-skewed. It can easily be observed in Figure 1. Most of the acquired companies have their price within 5.0e+09-mark. I also notice the existence of some very extreme cases in this plot (outliers).

Figure 2. Distribution for the Price in USD for individual companies

Figure 2. Distribution for the Price in USD for individual companies

Figure 2 The distribution of the Total funding in USD (Figure 2), which measures the total money a company receives in their funding period, seems to be right-skewed. Most of the acquired companies have the total funding in USD within 5.0e+08-mark. It also noticeable the existence of some very extreme cases in the figure 2 (outliers).

Figure 3. Distribution for the number of operation years for individual companies

Figure 3. Distribution for the number of operation years for individual companies

Figure 3 The distribution of the number of operation years (until the year of 2014) seems to resemble a bell shape although a significant right-skewed due to the existence of some extreme cases. The majority of the acquired companies are within 40 years of operation.

Figure 4. Distribution for the number of funding rounds for individual companies

Figure 4. Distribution for the number of funding rounds for individual companies

Figure 4 The distribution of the number of funding rounds seems to be right-skewed. The majority of the companies have the number of funding rounds within 8 rounds.

Figure 5. Scatter plot between the Pricing in USD and total funding in USD

Figure 5 The scatterplot shows that there seems to be some correlation between the Price in USD and the total funding in USD, without implying any causal effect, companies with a higher total money funding tend to also have been evaluated in high value ( price in USD) on the market. I also notice there are several extreme cases in the plot that significantly affects the trend of the linear line. One company called Whatsapp is taken by Facebook and only receives a little amount of funding but a remarkably acquired price amount in USD. And the company called Fisker Automotive receives a huge amount of funding but low in acquired price. Those two companies consider the outliers but those are valid and cannot delete out of the model.

Figure 6. Scatter plot between the Pricing in USD and number of funding rounds

Figure 6. The scatter plot shows that there seems to be a slight correlation between the Price in USD and the number of funding rounds.

Figure 7. Scatter plot between the Pricing in USD and number operation years

Figure 7. The scatter plot shows that there seems to be a slight correlation between the Price in USD and the number of operation years

Figure 7.1 Boxplot of relationship between  the price_amount_USD for a company for the acquirer_country_code

Figure 7.1 Boxplot of relationship between the price_amount_USD for a company for the acquirer_country_code

III. Multiple linear regression

i. Methods

The initial models is the following:


## 
## Call:
## lm(formula = price_amount_USD ~ funding_total_usd + funding_rounds + 
##     num_years + acquirer_country_code, data = tidy_joined_dataset)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.457e+09 -2.595e+08 -1.617e+08  3.184e+06  1.867e+10 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -7.113e+06  9.742e+08  -0.007  0.99418    
## funding_total_usd         2.414e+00  4.779e-01   5.051  5.4e-07 ***
## funding_rounds           -1.614e+07  2.262e+07  -0.714  0.47569    
## num_years                 1.366e+07  4.901e+06   2.788  0.00543 ** 
## acquirer_country_codeBEL -2.171e+08  1.186e+09  -0.183  0.85479    
## acquirer_country_codeBMU  2.557e+08  1.366e+09   0.187  0.85157    
## acquirer_country_codeCAN -8.857e+07  9.967e+08  -0.089  0.92921    
## acquirer_country_codeCHE  3.130e+07  1.026e+09   0.031  0.97566    
## acquirer_country_codeCHN  3.875e+06  1.009e+09   0.004  0.99694    
## acquirer_country_codeCZE -1.961e+08  1.180e+09  -0.166  0.86807    
## acquirer_country_codeDEU  4.660e+07  1.009e+09   0.046  0.96319    
## acquirer_country_codeESP  1.350e+07  1.060e+09   0.013  0.98984    
## acquirer_country_codeFIN -2.577e+08  1.361e+09  -0.189  0.84994    
## acquirer_country_codeFRA  3.722e+08  1.005e+09   0.370  0.71113    
## acquirer_country_codeGBR -7.019e+07  9.877e+08  -0.071  0.94336    
## acquirer_country_codeGRC -2.456e+08  1.367e+09  -0.180  0.85743    
## acquirer_country_codeHKG -4.782e+07  1.361e+09  -0.035  0.97198    
## acquirer_country_codeIND -1.107e+08  1.114e+09  -0.099  0.92088    
## acquirer_country_codeIRL -1.398e+07  1.034e+09  -0.014  0.98921    
## acquirer_country_codeISR -8.724e+07  1.014e+09  -0.086  0.93146    
## acquirer_country_codeITA -1.634e+08  1.366e+09  -0.120  0.90479    
## acquirer_country_codeJPN  2.627e+08  9.919e+08   0.265  0.79117    
## acquirer_country_codeKOR  6.619e+07  1.114e+09   0.059  0.95265    
## acquirer_country_codeLUX  8.845e+07  1.361e+09   0.065  0.94819    
## acquirer_country_codeMYS -1.248e+08  1.181e+09  -0.106  0.91586    
## acquirer_country_codeNLD  7.912e+05  1.118e+09   0.001  0.99944    
## acquirer_country_codeNOR -7.765e+07  1.112e+09  -0.070  0.94436    
## acquirer_country_codeRUS -5.821e+07  1.364e+09  -0.043  0.96597    
## acquirer_country_codeSGP  5.554e+07  1.179e+09   0.047  0.96245    
## acquirer_country_codeSWE -1.279e+08  1.183e+09  -0.108  0.91396    
## acquirer_country_codeTWN -8.878e+06  1.116e+09  -0.008  0.99366    
## acquirer_country_codeUSA  1.747e+08  9.670e+08   0.181  0.85668    
## acquirer_country_codeZAF  1.928e+08  1.364e+09   0.141  0.88766    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 961500000 on 835 degrees of freedom
## Multiple R-squared:  0.05162,    Adjusted R-squared:  0.01527 
## F-statistic:  1.42 on 32 and 835 DF,  p-value: 0.06287

I intended to use a linear model on the given data, then performed a residual analysis, as an in-sample validation method, to detect any systematic departure from the assumptions upon which the model is built: normality, independence, and homoscedasticity of the residuals.

In Figure 8, the 4 plots present the residual analysis that we can observe the validity of the model with four assumptions of linear model. The first left plot between the fitted value and residuals is violated by the constant variance assumptions. And the several observations seem to significantly affect the 0 horizontal line. Moreover, the Normal Q-Q plot at the top right shows the violation of the normality assumption.

Due to the violation of the normality and homoscedasticity assumption mentioned above, I recognize that a transformation is much needed. Using the method of log-likelihood (Figure 9), our dependent variable (CCC) will be transformed by the factor of 0.10101. As we can observe from the Box-Cox transformation plot below, the optimal value is quite nere 0, which means the log transformation in the Box-Cox function. To simply the transformation, the log transformation will be use for the transformed Price in USD variable

And also from the figure 5 (Scatter plot between the Pricing in USD and total funding in USD), the slope of the trend line in the plot might change in different funding total in USD, because most of the data points are within 5.0e+0.8. I recognized that a linear model might not be the best model to capture this complex behaviour of the given data, so I decided to make use of the natural spline model. Based on the rule of thumb that we covered in class, I decided to have 4 knots in the model.

Figure 8. Residual analysis graphs

Figure 8. Residual analysis graphs

Figure 9. Graph resulting from a Box Cox Test

Figure 9. Graph resulting from a Box Cox Test

From figure 9, the plot has shown more promising results: the top left plot shows the heteroscedasticity is preserved. It does not show any pattern around the 0 horizontal line, and the outliers do not significantly affect the trend of the given data. The normality assumption is much better as we can observe from the top right plot (Normal Q-Q plot). Moreover, the figure 8 shows the distribution of transformed price in USD seems assembly bell shaped than the previous model. It just slightly left-skewed due to some influential points.

Figure 9. Residuals distribution for the new transformed model

Figure 9. Residuals distribution for the new transformed model

Figure 9. Residuals distribution for the new transformed model

Figure 9. Residuals distribution for the new transformed model

VIF table To ensure that multicollinearity is not a problem in the transformed model, the VIF values were calculated for the variables in the transformed model. This indicates that there is a moderate correlation between the predictor variables. Since there is not a lot of multicollinearity between the predictor variables, the statistical power of the model is not greatly reduced.

##                                   GVIF Df GVIF^(1/(2*Df))
## ns(funding_total_usd, df = 4) 1.587440  4        1.059466
## funding_rounds                1.522953  1        1.234080
## num_years                     1.101111  1        1.049338
## acquirer_country_code         1.194533 29        1.003069

ii. Model Results and Interpretation

## 
## Call:
## lm(formula = price_amount_USD ~ ns(funding_total_usd, df = 4) + 
##     funding_rounds + num_years + acquirer_country_code, data = tidy_joined_dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1705  -0.7559   0.1071   0.9109   4.6616 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    17.367302   1.639151  10.595  < 2e-16 ***
## ns(funding_total_usd, df = 4)1  1.518374   0.190354   7.977 4.97e-15 ***
## ns(funding_total_usd, df = 4)2  6.011536   0.841266   7.146 1.96e-12 ***
## ns(funding_total_usd, df = 4)3  3.661928   0.890139   4.114 4.28e-05 ***
## ns(funding_total_usd, df = 4)4  1.419488   1.592235   0.892   0.3729    
## funding_rounds                 -0.084706   0.042493  -1.993   0.0465 *  
## num_years                       0.036363   0.008377   4.341 1.59e-05 ***
## acquirer_country_codeBEL       -1.793152   1.991478  -0.900   0.3682    
## acquirer_country_codeBMU        2.076348   2.294353   0.905   0.3657    
## acquirer_country_codeCAN       -1.586006   1.673598  -0.948   0.3436    
## acquirer_country_codeCHE        0.425500   1.723669   0.247   0.8051    
## acquirer_country_codeCHN       -1.450380   1.694238  -0.856   0.3922    
## acquirer_country_codeCZE       -2.126579   1.980717  -1.074   0.2833    
## acquirer_country_codeDEU       -0.299762   1.695008  -0.177   0.8597    
## acquirer_country_codeESP       -0.481805   1.780442  -0.271   0.7868    
## acquirer_country_codeFIN       -2.268971   2.283324  -0.994   0.3207    
## acquirer_country_codeFRA        0.219254   1.686767   0.130   0.8966    
## acquirer_country_codeGBR       -1.744144   1.658712  -1.052   0.2933    
## acquirer_country_codeGRC       -2.583876   2.295457  -1.126   0.2606    
## acquirer_country_codeHKG       -0.172530   2.282324  -0.076   0.9398    
## acquirer_country_codeIND       -1.185563   1.870943  -0.634   0.5265    
## acquirer_country_codeIRL       -0.357638   1.734366  -0.206   0.8367    
## acquirer_country_codeISR       -1.152517   1.703456  -0.677   0.4989    
## acquirer_country_codeITA       -2.891978   2.292183  -1.262   0.2074    
## acquirer_country_codeJPN        0.668133   1.665102   0.401   0.6883    
## acquirer_country_codeKOR       -0.094334   1.873618  -0.050   0.9599    
## acquirer_country_codeLUX        0.575647   2.285155   0.252   0.8012    
## acquirer_country_codeMYS       -2.290794   1.982772  -1.155   0.2483    
## acquirer_country_codeNLD       -0.449031   1.875241  -0.239   0.8108    
## acquirer_country_codeNOR       -1.682870   1.867278  -0.901   0.3677    
## acquirer_country_codeRUS        0.526924   2.291938   0.230   0.8182    
## acquirer_country_codeSGP        0.010985   1.979101   0.006   0.9956    
## acquirer_country_codeSWE       -1.718902   1.986818  -0.865   0.3872    
## acquirer_country_codeTWN       -0.584093   1.873219  -0.312   0.7553    
## acquirer_country_codeUSA       -0.159879   1.624061  -0.098   0.9216    
## acquirer_country_codeZAF        1.869768   2.290738   0.816   0.4146    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.612 on 832 degrees of freedom
## Multiple R-squared:  0.2924, Adjusted R-squared:  0.2627 
## F-statistic: 9.825 on 35 and 832 DF,  p-value: < 2.2e-16

Given the nature of splines, the interpretability of this model coefficient is deemed futile. The goal of this model is to help us predict the price of the company in the market in the acquisition process. However, the price in USD is in the log transformed and we need to transform back if we need a meaningful outcome.To focus on examining the coefficients and their relative significance compared to other models relies on Omnibus test results that we will go over in the ANOVA table analysis section.

However, in the result summary table of the transformed model, the funding_total_inUSD, funding rounds, number of operation years have the p-value <0.05 and it each of this variable shows the statistically significant result with the transformed Price in USD if it stand alone in the model

Whereas the acquirer country were found to be insignificant with p-values > 0.05 for each categorical level.

Seeing the adjusted R-squared of 0.2627 using our model, I found that it explains quite a lot of variability of the transformed price in USD of each company, coupled with the significance of the predictors and very low p-value of 2.2e-16 for our model. It leads to the conclusion that this model is helpful in its explanatory ability.


iii. Inference for multiple regression

Table 6. ANOVA Table
Df Sum Sq Mean Sq F value Pr(>F)
ns(funding_total_usd, df = 4) 4 627.9451 156.9863 60.4081 0.0000
funding_rounds 1 22.6818 22.6818 8.7279 0.0032
num_years 1 47.4809 47.4809 18.2706 0.0000
acquirer_country_code 29 195.5342 6.7426 2.5945 0.0000
Residuals 832 2162.1702 2.5988 NA NA

Interpretation of the ANOVA table result from table 6:

The funding total USD with 3 knots with 4 degrees of freedom keeps adding 627.9451 sum of squares. With an F value = 60.4081 and p-value = 0.0001, we can conclude that the Stringency Index alone in the model explains a significant amount of variability of transformed price in USD.

The number of funding rounds with 1 degree of freedom keeps adding 22.6818 sum of squares. With an F value = 8.7279 and p-value = 0.0032, we can conclude that the model with the number of funding rounds,given that the funding total USD is in the model, is statistically significant.

The number of operation years with 1 degree of freedom keeps adding 47.4809 sum of squares .With an F value = 8.18.2706 and p-value = 0.0001, we can conclude that the model with the number of operation years, given that the funding total USD and number of funding rounds are in the model, is statistically significant.

The acquirer countries with 29 degrees of freedom keep adding 195.5342 sum of squares. With an F value =2.5945 and p-value= 0.0001, we can conclude that the model with the acquirer countries, given that the funding total USD, number of funding rounds and number of operation years, is statistically significant.

For the 95% Prediction Intervals, any company with funding total in USD equals to its median 1.500e+07, the number of funding rounds equal to its median 2 rounds, the acquirer is from USA and they have 1 year in the market (operation time), their log of Price in USD can be predicted is 17.95792 with the lower limit 14.78480 and upper limit 21.13104.

With those companies holding the same value funding total equals to its median 1.500e+07 in USD, the number of funding rounds equal to its median 2 rounds and the acquirer is from USA. The Prediction Interval below shows the predicted log of Price in USD for number of operation years (at 1,5,10,15).

Table 7. The 95% Prediction intervals for the Price in USD, where number of operationn years = 1,5, 10, 15, respectively, for funding_total_usd equal to its median = 1.500e+07, funding_rounds equal to its median = 2, and the acquirer_country is USA.
num_years Point Estimate Lower Limit Upper Limit
1 17.95792 14.78480 21.13104
5 18.10337 14.93315 21.27359
10 18.28519 15.11667 21.45371
15 18.46700 15.29806 21.63595

IV. Discussion

i. Conclusions

In conclusion, the interpretability is sometimes to be traded for the sake of a better model. The analysis shows that the model I proposed seems to be helpful as it explains quite good amount of variability in Price in USD (about 29.24%)

The analysis also shows some evidence to suggest that the Price in USD of a company in acquisition transaction is positively correlated with total funding in USD, which aligns with my expectation, for the more of the funding could somehow represent for the potential and worthiness of that company. Moreover, it is also positively correlated with the number of funding rounds and number of operation years in spite of that relationships are quite not strong in my expectation.

ii. Limitations

First limitation I encountered with this predictive model is the data available. The data set is published by Crunch Based company and the it does not show in details some interesting variables that I think it will contribute in the model. Moreover, some notable outliers and points with high leverage that valid and cannot be eliminate from the model. It leads to the significantly effect on the power of the model.

The categorical variable (accquirer_country_code) has many levels and I do not have time and knowledge to handle this situation. Some strategies in the Internet with the help with rigde or lasso Regression, but I do not think it would help me in this situation.

The choice to use a non-linear model made the interpretation of the relationship between the variables more complex and less straightforward, which is not a bad thing when used appropriately. However, no test was done to check for overfitting, so the adequacy of the complexity of the model cannot be determined.

iii. Further questions

As I mentioned above, it need some technique to deal with the categorical variable with many levels.